Introducing taxastand and dwctaxon

A pair of R packages for standardizing species names in Darwin Core format

Joel Nitta, Wataru Iwasaki
The University of Tokyo

Botany 2022
https://joelnitta.github.io/botany_2022_taxastand

Species names are the “glue” that connect datasets

Page (2013)

Synonyms break linkages

In the age of big data, software is needed to resolve taxonomy

Shortcomings of current approaches

  • Many tools only available via an online interface (API)
    • Difficult to reproduce
  • Limited number of reference databases to choose from
    • May not be able to implement taxonomy of choice
  • Existing tools do not recognize the rules of taxonomic nomenclature
    • May not be able to accurately match names

Features of taxastand

  • Run locally in R
  • Allows usage of a custom reference database
  • Supports fuzzy matching
  • Understands taxonomic rules

Available at https://github.com/joelnitta/taxastand

Usage

Installation

In R:

# install remotes first
install.packages("remotes")
remotes::install_github("joelnitta/taxastand")
library(taxastand)

Also, need to either install taxon-tools or Docker

Fuzzy matching

res <- ts_match_names(
    query = "Crepidomanes minutus",
    reference = c(
      "Crepidomanes minutum",
      "Hymenophyllum polyanthos"),
    simple = TRUE,
    docker = TRUE
    )
glimpse(res)
Rows: 1
Columns: 3
$ query      <chr> "Crepidomanes minutus"
$ reference  <chr> "Crepidomanes minutum"
$ match_type <chr> "auto_fuzzy"

Matching based on taxonomic rules

res <- ts_match_names(
    query = "Crepidomanes minutum K. Iwats.",
    reference = c(
      "Crepidomanes minutum (Bl.) K. Iwats.",
      "Hymenophyllum polyanthos (Sw.) Sw."),
    simple = TRUE,
    docker = TRUE
    )
glimpse(res)
Rows: 1
Columns: 3
$ query      <chr> "Crepidomanes minutum K. Iwats."
$ reference  <chr> "Crepidomanes minutum (Bl.) K. Iwats."
$ match_type <chr> "auto_basio-"

For name resolution, need a reference database

data(filmy_taxonomy)
head(filmy_taxonomy[c("taxonID", "acceptedNameUsageID",
  "taxonomicStatus", "scientificName")])
# A tibble: 6 × 4
   taxonID acceptedNameUsageID taxonomicStatus scientificName                   
     <dbl>               <dbl> <chr>           <chr>                            
1 54115096                  NA accepted name   Cephalomanes atrovirens Presl    
2 54133783            54115097 synonym         Trichomanes crassum Copel.       
3 54115097                  NA accepted name   Cephalomanes crassum (Copel.) M.…
4 54133784            54115098 synonym         Trichomanes densinervium Copel.  
5 54115098                  NA accepted name   Cephalomanes densinervium (Copel…
6 54133785            54115099 synonym         Trichomanes infundibulare Alderw.

Where to get taxonomic data?

Name resolution

res <- ts_resolve_names(
  query = "Gonocormus minutum",
  ref_taxonomy = filmy_taxonomy,
  docker = TRUE)
glimpse(res)
Rows: 1
Columns: 6
$ query           <chr> "Gonocormus minutum"
$ resolved_name   <chr> "Crepidomanes minutum (Bl.) K. Iwats."
$ matched_name    <chr> "Gonocormus minutus (Bl.) Bosch"
$ resolved_status <chr> "accepted name"
$ matched_status  <chr> "synonym"
$ match_type      <chr> "auto_fuzzy"

dwctaxon

https://github.com/joelnitta/dwctaxon

Goal

  • Enable simple, error-free editing of DWC taxon data

Example: filmy ferns

filmies <- head(dct_filmies) |>
  filter(str_detect(scientificName, "crassum|densinervium"))

filmies
# A tibble: 4 × 4
  taxonID  acceptedNameUsageID taxonomicStatus scientificName                   
  <chr>    <chr>               <chr>           <chr>                            
1 54133783 54115097            synonym         Trichomanes crassum Copel.       
2 54115097 <NA>                accepted name   Cephalomanes crassum (Copel.) M.…
3 54133784 54115098            synonym         Trichomanes densinervium Copel.  
4 54115098 <NA>                accepted name   Cephalomanes densinervium (Copel…

Changing taxonomy is complicated

Old version:

  • Accepted species 1: Cephalomanes crassum
    • Synonym: Trichomanes crassum
  • Accepted species 2: Cephalomanes densinervium
    • Synonym: Trichomanes densinervium

New version (C. crassum → synonym of C. densinervium):

  • Accepted species: Cephalomanes densinervium
    • Synonym 1: Cephalomanes crassum
    • Synonym 2: Trichomanes crassum
    • Synonym 3: Trichomanes densinervium

Need to account for all synonyms

dct_change_status() handles synonym mapping

dct_change_status(
  tax_dat = filmies,
  sci_name = "Cephalomanes crassum (Copel.) M. G. Price",
  new_status = "synonym",
  usage_name = "Cephalomanes densinervium (Copel.) Copel."
)
# A tibble: 4 × 4
  taxonID  acceptedNameUsageID taxonomicStatus scientificName                   
  <chr>    <chr>               <chr>           <chr>                            
1 54133783 54115098            synonym         Trichomanes crassum Copel.       
2 54115097 54115098            synonym         Cephalomanes crassum (Copel.) M.…
3 54133784 54115098            synonym         Trichomanes densinervium Copel.  
4 54115098 <NA>                accepted name   Cephalomanes densinervium (Copel…

dct_validate() checks taxonomic data

dct_change_status(
  tax_dat = filmies,
  sci_name = "Trichomanes crassum Copel.",
  new_status = "synonym",
  usage_name = "Trichomanes densinervium Copel."
) |>
dct_validate()
Error: `check_mapping` failed.
`taxonID`(s) detected whose `acceptedNameUsageID` value does not map to
`taxonID` of an existing name.
Bad `taxonID`: 54133783
Bad `scientificName`: Trichomanes crassum Copel.

Putting it all together (with |>)

ferns_tax_raw |>
# Add entry for Dryopteris simasakii var. simasakii autonym
  dct_add_row(
    sci_name = "Dryopteris simasakii var. simasakii",
    taxonomicStatus = "accepted",
    taxonRank = "variety",
    parentNameUsageID = "37XPH",
  ) |>
  # Change status of Parahemionitis arifolia as indicated by plastome data
  dct_change_status(
    sci_name = "Parahemionitis arifolia (Burm. fil.) Panigrahi",
    new_status = "accepted"
  ) |>
  dct_change_status(
    sci_name = "Hemionitis arifolia (Burm. fil.) T. Moore",
    new_status = "synonym",
    usage_name = "Parahemionitis arifolia (Burm. fil.) Panigrahi"
  ) |>
  # ... (other changes)
  dct_validate()

Summary

taxastand allows for reliable, customizable taxonomic resolution

  • Main feature: use of custom taxonomy
    • Advantage: can be adapted to different projects
    • Disadvantage: not simple to prepare/maintain reference db

Please choose the tool that works best for you!
(see Grenié et al. 2022)

Acknowledgements

  • Japan Society for the Promotion of Science

  • Members of the Iwasaki lab, The University of Tokyo

  • C. Webb

  • M. Hassler

References

Grenié, M., E. Berti, J. Carvajal‐Quintero, G. M. L. Dädlow, A. Sagouis, and M. Winter. 2022. Harmonizing taxon names in biodiversity data: A review of tools, databases and best practices. Methods in Ecology and Evolution:2041–210X.13802.
Page, R. D. M. 2013. BioNames: linking taxonomy, texts, and trees. PeerJ 1:e190.